Strengthen deobfuscation: lift spent string-array decoders, infer names, sharpen metrics#62
Merged
Merged
Conversation
The synthetic javascript-obfuscator (obfuscator.io) fixtures in samples/generated/ gated correctness via manifest.json but were excluded from the readability metrics: both the live report binary and the committed SCOREBOARD.md read samples/ non-recursively. Add a per-profile rollup of the generated corpus (aggregated over all seeds, one row per obfuscation technique) to both surfaces, so the obfuscator.io samples count toward readability the same way they count toward correctness. kept% is byte-weighted, opaque% is the mean per-file ratio, rounds is the worst case, and converged flags any non-fixpoint.
rename: RenameByRole now infers meaningful names for array-iteration callback params (reduce->acc/value, map/filter/forEach->item/index, sort->left/right), C-style loop counters (->index), and catch bindings (->error), instead of falling back to generic varN. Names stay >=3 chars so they remain idempotent under the opaque-name guard, and reuse the existing scope de-duplicator. report/golden: add a hexrefs column (raw, non-distinct count of _0x... identifier occurrences) to the live dashboard and committed scoreboard. opaque% counts DISTINCT tokens, so a single surviving decoder referenced N times barely moved it; hexrefs spikes when a string-array decoder is left intact, so the board now flags the worst failures (strarr_base64 163, strarr_rc4 211, numbers_keys 231, strong 385) instead of greenlighting them. Snapshots/scoreboard re-blessed.
…ional-chain members
dce: a string-array decoder's accessor memoizes through its own name
(if (f.flag===undefined){ f.cache={}; ... } ... f.cache[k] ...). After
every call site is inlined by decoder-lift, the only surviving references
to f are reads of its own properties inside its own body, which pinned
the spent decoder and its entire encoded string array alive forever.
fn_decl_is_dead now treats a function as dead when every resolved read of
its symbol is lexically inside its own body (shadowing-safe via reference
resolution), with the existing guard that a still-called self-reassigning
function is kept. Collapses the obfuscator.io string-array profiles:
strarr_rc4 kept 72%->19% (hexrefs 211->3), strarr_base64 68%->28%
(163->3); corpus output 328K->154K bytes, hexrefs 998->217.
member-normalize: a?.["foo"] parses as a ChainElement, not an
Expression, so optional-chained computed members were never normalized.
Added enter_chain_element to rewrite them to a?.foo (identifier keys
only, optional flag preserved). Covered by a new phase1 test.
Snapshots/scoreboard re-blessed; full equivalence/determinism/corpus net green.
…-zft90c # Conflicts: # src/bin/report.rs # tests/golden.rs # tests/snapshots/SCOREBOARD.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Acts on the gaps the obfuscator.io metrics exposed once the synthetic corpus was counted (PR #61). Five parallel workstreams, integrated and verified together.
Results (obfuscator.io corpus, by profile)
Corpus output shrank 328K → 154K bytes; the string-array profiles now deobfuscate to near the already-solved
controlflow/deadcode/minimalbaseline.What changed
1. Decoder scaffolding cleanup (
src/passes/dce.rs) — the high-leverage fix. The Boa sandbox was already lifting string-array decoders correctly, but javascript-obfuscator's accessor memoizes through its own name (if (f.flag===undefined){ f.cache={} } … f.cache[k] …). After inlining every call site, the only surviving references tofwere reads of its own properties inside its own body, which pinned the spent decoder and its entire encoded array alive.fn_decl_is_deadnow treats a function as dead when every resolved read of its symbol is lexically inside its own body (shadowing-safe), keeping the existing guard for still-called self-reassigning functions.2. Name inference (
src/passes/rename.rs) —RenameByRolenow names array-iteration callback params (reduce→acc/value,map/filter/forEach→item/index,sort→left/right), C-style loop counters (index), and catch bindings (error) instead of genericvarN. Names stay ≥3 chars so they remain idempotent under the opaque guard.3.
hexrefsmetric (src/bin/report.rs,tests/golden.rs) —opaque%counts distinct tokens, so a single decoder referenced N times barely moved it.hexrefs(raw_0x…occurrence count) spikes when a decoder survives intact, so the board now flags the worst failures instead of greenlighting them.4. Optional-chain member normalization (
src/passes/member_normalize.rs) —a?.["foo"]parses as aChainElement, not anExpression, so it was never normalized; addedenter_chain_elementto rewrite toa?.foo(identifier keys only).Two investigations (no code change warranted)
v1…v999naming to cut the size cost.Known remaining residue
strong/numbers_keyskeep some_0xrefs inside opaque-predicate dead branches wrapped in object-method proxies (obj.m(obj.x, obj.x)≡"a"==="a"); collapsing those needs proxy inlining + predicate folding, a separate layer. Thebracketsmetric also overcounts — most residual["on sample_7/10/3 are array/object literals and decoder-gated base64 keys, not convertible member access.Verification
Full slow net green: golden snapshots (re-blessed),
sample_equivalence(behavior preserved on every real sample),generated_corpus(deobfuscated form reproduces manifest output on all 140),determinism(5),equivalence(102),phase1(17).cargo clippyclean.https://claude.ai/code/session_01EjhNTCU89wa5zaeRHMnfEc
Generated by Claude Code